11 Knowledge Graph Embedding

A knowledge graph is a graph of data intended to convey knowledge of the real world. Nodes represent entities, while edges represent binary relations between entities. Data graphs are potentially enhanced with schema to provide meaning.

KGs are grounded in the Open World Assumption. The most used representation languages are RDF, RDFS, and OWL. They can represent very large data collections, but they can suffer from incompleteness and noise.

11.1 ML on KG

Tasks possible with KGs are:

Link prediction: predicts missing links between entities, can be regarded as a learning to rank problem
Triple classification: assesses correctness of a statement with respect to a knowledge graph, can be regarded as a binary classification problem
Link-based clustering (for example to do customer segmentation)
Entity matching for duplicate detection or deduplication

We have two different possible perspectives:

KG as input to ML: the goal is to improve the performance in many learning tasks
ML as input to KG: the goal is to improve the KG itself, by creating new facts, generalizations, improving the size, coverage, depth, accuracy and, in general to remove noise and incompleteness

11.1.1 Link prediction

We can employ traditional relational learning:

Logic programming: predict new links from facts and rules
Inductive Logic Programming: predict new links from rules extracted from facts
Rule mining: extracts Horn clauses based on their support in the KG
Graphical Models

The limitations are the limited scalability to KG size and the limited expressiveness.

Given the very large data collections new scalable Machine Learning methods are needed, grounded on numeric-based approaches, like vector embedding.

11.2 Embeddings

KGE models convert data graph into an optimal low-dimensional space, preserving as much as possible graph structural information and properties.

The properties to be retained are:

Symmetry: < Alice marriedTo Bob>
Asymmetry: < Alice childOf Jack>
Inversion: < Alice childOf Jack>, < Jack fatherOf Alice>
Composition: < Alice childOf Jack>, < Jack siblingOf Mary>, < Alice nieceOf Mary>
Hierarchies

11.3 General KB Embedding Pipeline

The goal is to learn embeddings such that the score of a valid positive triple is higher than the score of an invalid negative triple.

11.3.1 Anatomy of a KGE Model

Knowledge Graph \mathcal{G}
Lookup layer
Scoring function for a triple f(t)
- High score = high chances for the triple to be a true fact
Loss function \mathcal{L}
Optimization algorithm
Negatives generation strategy

11.3.2 Scoring functions

TransE: Translating Embeddings f_{TransE} = - ||(e_s + r_p) - e_o||_n
e_s is the vector for the subject
r_p is the vector for the relation
e_o is the vector for the object

The idea is that for a true fact taking the subject vector and adding the vector for the relation, you should end up at a point in the vector space that is very close to the object vector.

RotatE: relations are modeled as rotations in complex space \mathcal{C} using element-wise product between embeddings

f_{RotatE} = - ||e_s r_p - e_o||_n

Models based on tensor factorization:
- RESCAL: f_{RESCAL} = e_s^TW_r e_o; it does not scale well
- DistMul: f_{DistMult} = <r_p, e_s, e_o> (dot product)
- ComplEx: DistMult with dot product in \mathcal{C}
Models based on neural network
- ConvE: reshaping + convolution
- ConvKB: convolutions and dot product
- They are computationally expensive!

11.3.3 Loss functions

Pairwise Margin-Based Hinge Loss (used by TransE) \mathcal{L}(x) = \sum_{t^+ \in G} \sum_{t^- \in C} \max(0, [\gamma + f(t^-; x) - f(t^+; x)])
\max(0, [\gamma + f(t^-; x) - f(t^+; x)]): pays a penalty if score of positive triple is less than the score of synthetic negative by a margin \gamma
f(t^-; x): score assigned to a synthetic negative
f(t^+; x): score assigned to true triple
Cross-Entropy
Binary Cross-Entropy

11.3.4 Negatives Generation

If we assume the Local Closed World Assumption, that is, the KG is only locally complete, we can generate negative by corrupting triples present in the KG. The corruption is employed by changing the subject or the object of a triple, while the predicate is unaltered.

How many negatives for a positive?

Uniform sampling: generate all possible synthetic negatives and sample n negative for each positive t
Complete set: use all possible synthetic negatives. We need to mind the scalability and the fact that we can come up with false negatives (triples actually present in the KG)
1-n scoring: we take batches of (s, r) pair and score it against all entities o \in Train simultaneously, labeling as positives if that pair is in training KG or negatives if that pair is not in training KG

11.3.5 Evaluation Metrics for Link prediction

Mean Rank: (sensitive to outlier) MR = \dfrac{1}{|Q|} \sum_{i = 1}^{|Q|} rank_{(s, p, o)_i}
Mean Reciprocal Rank: (less sensitive to outlier) MRR = \dfrac{1}{|Q|} \sum_{i = 1}^{|Q|} \dfrac{1}{rank_{(s, p, o)_i}}
Hits@N: count positive examples in the first N positions Hits@N = \sum_{i = 1}^{|Q|} 1 \; \text{if} \; \text{rank}_{(s, p, o)_i} \le N